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A new clustering approach based on mode identification is developed by applying new optimiza- 
tion techniques to a nonparametric density estimator. A cluster is formed by those sample points 
that ascend to the same local maximum (mode) of the density function. The path from a point to 
its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). 
This method is then extended for hierarchical clustering by recursively locating modes of kernel 
density estimators with increasing bandwidths. Without model fitting, the mode-based clustering 
yields a density description for every cluster, a major advantage of mixture-model-based clustering. 
Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnos- 
ing clustering results is also investigated. Specifically, a pairwise separability measure for clusters 
is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved 
for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, 
a cluster merging procedure is created to enforce strong separation. Experiments on simulated and 
real data demonstrate that the mode-based clustering approach tends to combine the strengths of 
linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions 
and when clusters deviate substantially from Gaussian distributions. Both of these cases pose diffi- 
culty for parametric mixture modeling. A C package on the new algorithms is developed for public 
access at http://www.stat.psu.edu/~jiali/hmac. 


Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline 
EM, nonparametric density 


1. Introduction 


Clustering is a technology employed in tremendously diverse areas for a multitude of purposes. It 
simplifies massive data by extracting essential information, based on which many subsequent anal- 
ysis or processes become feasible or more efficient. For instance, in information systems, clustering 
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is applied to text documents or images to speed up indexing and retrieval (Li, 2005a). Clustering 
can be a stand-alone process. For example, microarray gene expression data are often clustered in 
order to find genes with similar functions. Clustering is also the technical core of several prototype- 
based supervised learning algorithms (Hastie et al., 2001) and has been extended to non-vector data 
in this regard (Li and Wang, 2006). Recent surveys (Kettenring, 2006; Jain et al., 1999) discuss the 
methodologies, practices, and applications of clustering. 


1.1 Background 


Clustering methods fall roughly into three types. The first type uses only pairwise distances between 
objects to be clustered. These methods enjoy wide applicability since a tractable mathematical 
representation for objects is not required. However, they do not scale well with large data sets 
due to the quadratic computational complexity of calculating all the pairwise distances. Examples 
include linkage clustering (Gower and Ross, 1969) and spectral graph partitioning (Pothen et al., 
1990). The second type targets on optimizing a given merit function. The merit function reflects 
the general belief about good clustering, that is, objects in the same cluster should be similar to 
each other while those in different clusters be as distinct as possible. Different algorithms vary in 
the similarity measure and the criterion for assessing the global quality of clustering. K-means and 
k-center clustering (Gonzalez, 1985) belong to this type. 


The third type relies on statistical modeling (Fraley and Raftery, 2002). In particular, each 
cluster is characterized by a basic parametric distribution (referred to as a component), for instance, 
the multivariate Gaussian for continuous data and the Poisson distribution for discrete data. The 
overall probability density function (pdf) is a mixture of the parametric distributions (McLachlan 
and Peel, 2000). The clustering procedure involves first fitting a mixture model, usually by the 
EM algorithm, and then computing the posterior probability of each mixture component given a 
data point. The component possessing the largest posterior probability is chosen for that point. 
Points associated with the same component form one cluster. Moreover, the component posterior 
probabilities evaluated in mixture modeling can be readily used as a soft clustering scheme. In 
addition to partitioning data, a probability distribution is obtained for each cluster, which can be 
helpful for gaining insight into the data. Another advantage of mixture modeling is its flexibility in 
treating data of different characteristics. For particular applications, mixtures of distributions other 
than Gaussian have been explored for clustering (Banfield and Raftery, 1993; Li and Zha, 2006). 
Banerjee et al. (2005) have also used the mixture of Mises-Fisher distributions to cluster data on a 
unit sphere. 


The advantages of mixture modeling naturally result from its statistical basis. However, the 
parametric assumptions about cluster distributions are found restrictive in some applications. Li 
(2005b) addresses the problem by assuming each cluster itself is a mixture of Gaussians, providing 
greater flexibility for modeling a single cluster. This method involves selecting the number of 
components for each cluster and is sensitive to initialization. Although some success has been 
shown using the Bayesian Information Criterion (BIC), choosing the right number of components 
for a mixture model is known to be difficult, especially for high dimensional data. 

Another limitation for mixture modeling comes from the sometimes serious disparity between 
a component and a cluster complying to geometric heuristics. If a probability density is estimated 
from the data, preferably, every cluster corresponds to a unique “bump” in the density resulting from 
a tight mass of data. We refer to a bump as a “hill” and the local maximum associated with it the 
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“hilltop”, that is, the mode. The rational for clustering by a mixture model is that if the component 
distributions each possess a single hilltop, by fitting a mixture of them, every component captures 
one separate hilltop in the data. However, this is often not true. Without careful placement and 
control of their shapes, the mixture components may not align with the hills of the density, especially 
when clusters are poorly separated or the assumed parametric component distribution is violated. It 
is known that two Gaussians located sufficiently close result in a single mode. (On the other hand, a 
two component multivariate Gaussian mixture can have more than two modes, as shown by Ray and 
Lindsay, 2005). In this case, equating a component with a cluster is questionable. This profound 
limitation of mixture modeling has not been adequately investigated. In fact, even to quantify the 
separation between components is not easy. 


Here, we develop a new nonparametric clustering approach, still under a statistical framework. 
This approach inherits the aforementioned advantages of mixture modeling. Furthermore, data are 
allowed to reveal a nonparametric distribution for each cluster as part of the clustering procedure. It 
is also guaranteed that every cluster accounts for a distinct hill of the probability density. 


1.2 Clustering via Mode Identification 


To avoid restrictions imposed by parametric assumptions, we model data using kernel density func- 
tions. By their nature, such densities have a mixture structure. Given a density estimate in the form 
of a mixture, a new algorithm, aptly called the Modal EM (MEM), enables us to find an increasing 
path from any point to a local maximum of the density, that is, a hilltop. Our new clustering algo- 
rithm groups data points into one cluster if they are associated with the same hilltop. We call this 
approach modal clustering. A new algorithm, namely the Ridgeline EM (REM), is also developed 
to find the ridgeline linking two hilltops, which is proven to pass through all the critical points of 
the mixture density of the two hills. 


The MEM and REM algorithms allow us to exploit the geometry of a probability density func- 
tion in a nontrivial manner. As a result, clustering can be conducted in accurate accordance with 
our geometric heuristics. Specifically, every cluster is ensured to be associated with a hill, and ev- 
ery sample point in the cluster can be moved to the corresponding hilltop along an ascending path 
without crossing any “valley” that separates two hills. Moreover, by finding the ridgeline between 
two hilltops, the way two hills separate from each other can be adequately measured and exhibited, 
enabling diagnosis of clustering results and application-dependent adjustment of clusters. Modal 
clustering using MEM also has practical advantages such as the irrelevance of initialization and the 
ease of implementing required optimization techniques. 


Our modal clustering algorithm is not restricted to kernel density estimators. In fact, it can be 
used to find the modes of any density in the form of a mixture distribution. It is known that when the 
number of components in a mixture increases, as long as there are sufficiently many components, 
the overall fitted density of the mixture is not sensitive to that number. On the other hand, the 
resulting partition of data can change dramatically if we identify each mixture component with a 
cluster, as normally practiced in mixture-model-based clustering. In modal clustering, there is no 
such identification, and mixture components only play the role of approximating a density. We 
thus have much more flexibility at choosing mixture distributions. Specifically, we adopt the fully 
nonparametric kernel density estimation, using Gaussian kernels for continuous data. 


We summarize the main contributions of this paper as follows: 
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e A new nonparametric statistical clustering algorithm and its hierarchical extension are devel- 
oped by associating data points to their modes identified by MEM. Approaches to improve 
computational efficiency and to visualize cluster structures for high dimensional data are pre- 
sented. 


e The REM algorithm is developed to find the ridgeline between the modes of any two clusters. 
Measures for the pairwise separability between clusters are proposed using ridgelines. A 
cluster merging algorithm to enhance modal clustering is developed. 


e Experiments are conducted using both simulated and real data sets. Comparisons are made 
with several other clustering algorithms such as linkage clustering, k-means, and Gaussian 
mixture modeling. 


1.3 Related Work 


Clustering is an extensively studied research topic with vast existing literature (see Jain et al., 1999; 
Everitt et al., 2001, for general coverage). Works most related to ours are mode-based clustering 
methods independently developed in the communities of pattern recognition (Leung et al., 2000) 
and statistics (Cheng et al., 2004). The MEM and REM algorithms, the cluster diagnosis tool, and 
cluster merging procedure built upon REM are unique to our work. 

In pattern recognition, mode-based clustering is studied under the name of scale-space method, 
inspired by the smoothing effect of the human visual system. The scale-space clustering method 
is pioneered by Wilson and Spann (1990), and furthered studied by Roberts (1997) using density 
estimation and by Chakravarthy and Ghosh (1996) using the radial basis function neural network. 
Leung et al. (2000) forms a function called “space scale image” for a data set. This function is es- 
sentially a Gaussian kernel density estimate (differing from a true density by an ignorable constant). 
The modes of the density function are solved by numerical methods. To associate a data point with 
a mode, a gradient dynamic system starting from the point is defined and solved by the Euler differ- 
ence method. A hierarchical clustering algorithm is proposed by gradually increasing the Gaussian 
kernel bandwidth. The authors also note the non-nested nature of clustering results obtained from 
increasing bandwidths. 

It can be shown that the iteration equation derived from the Euler difference method is identi- 
cal to that from MEM. However, MEM applies generally to any mixture of Gaussians as well as 
mixtures of other parametric distributions. Its ascending property is proved rather than based on 
approximation. Under the framework of scale-space clustering, general mixtures do not arise as a 
concern, and naturally, only the case of Gaussian kernel density is discussed (Leung et al., 2000). 
It is not clear whether the gradient dynamic system can be efficiently solved for general mixture 
models. Our hierarchical clustering algorithm differs slightly from that of Leung et al. (2000) by 
enforcing nested clustering. This difference only reflects an algorithmic preference and is not in- 
trinsic to the key ideas of modal clustering. 

Cheng et al. (2004) defined a gradient tree to be the set of steepest ascent curves of a kernel 
density estimate, treating each sample point as the starting position of a curve. The gradient curves 
are similar to the paths solved by the gradient dynamic system of Leung et al. (2000), but are 
computed by discrete approximation. Minnotte and Scott (1993) developed the mode tree as a 
visualization tool for nonparametric density estimate. The emphasis is on graphically illustrating the 


1690 


NONPARAMETRIC MODAL CLUSTERING 


relationship between kernel bandwidths and the modes of uni- or bivariate kernel density functions. 
Minnotte et al. (1998) also extended the mode tree to the mode forest. 

Because clustering of independent vectors has been a widely used method for image segmen- 
tation, especially in the early days (Jain and Dubes, 1988), we test the efficiency of our algorithm 
on segmentation, a good example of computationally intensive applications. We have no intention 
to present our clustering algorithm as a state-of-the-art segmentation method although it may well 
be applied as a fast method. We note that much advance has been achieved in image segmentation 
using approaches beyond the framework of clustering independent vectors (Pal and Pal, 1993; Shi 
and Malik, 2000; Joshi et al., 2006). 

The rest of the paper is organized as follows. Notations and the MEM algorithm are introduced 
in Section 2. In Section 3, a new clustering algorithm and its hierarchical extension are developed 
by associating data points with the modes of a kernel density estimate. In Section 4, the REM algo- 
rithm for finding ridgelines between modes is presented. In addition, several measures of pairwise 
separability between clusters are defined, which lead to the derivation of a new cluster merging 
algorithm. This merging method strengthens the framework of modal clustering. In Section 5, we 
present a method to visualize high dimensional data so that the discrimination between clusters is 
well preserved. Experimental results on both simulated and real data sets and comparisons with 
other clustering approaches are provided in Section 6. Finally, we conclude and discuss future work 
in Section 7. 


2. Preliminaries 


We introduce in this section the Modal EM (MEM) algorithm that solves a local maximum of a 
mixture density by ascending iterations starting from any initial point. The algorithm is named 
Modal EM because it comprises two iterative steps similar to the expectation and the maximization 
steps in EM (Dempster et al., 1977). The objective of the MEM algorithm is different from the 
EM algorithm. The EM algorithm aims at maximizing the likelihood of data over the parameters of 
an assumed distribution. The goal of MEM is to find the local maxima, that is, modes, of a given 
distribution. 

Let a mixture density be f(x) = Z$; mfx(x), where x € R, Tp is the prior probability of 
mixture component k, and f(x) is the density of component k. Given any initial value x), MEM 
solves a local maximum of the mixture by alternating the following two steps until a stopping 
criterion is met. Start with r = 0. 


1. Let (o) 
Tk flx") 
= ———_ k= l,...,K. 
Pk F(x) ) genes 
2. Update 
K 
xt) = argmax > prlog fx(x). 
X k= 


The first step is the “Expectation” step where the posterior probability of each mixture com- 
ponent k, 1 < k < K, at the current point x is computed. The second step is the “Maximiza- 
tion” step. We assume that )*_, py log f(x) has a unique maximum, which is true when the f;(x) 
are normal densities. In the special case of a mixture of Gaussians with common covariance ma- 
trix, that is, f(x) = O(x | ux,Z), where 0(-) is the pdf of a Gaussian distribution, we simply have 
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xt) = yy pug. For other parametric densities f(x), the solution to the maximization in the 
second step can be more complicated and sometimes requires numerical procedures. On the other 
hand, similarly as in the EM algorithm, it is usually much easier to maximize yee prlog f(x) than 
the original objective function log Y*_, T f(x). 

The proof of the ascending property of the MEM algorithm is provided in Appendix A. We omit 
a rigorous discussion regarding the convergence of x”) here. By Theorem 1 of Wu (1983), if f (x) 
is a mixture of normal densities, all the limit points of {x} are stationary points of f(x). It is 
possible that {xC )} converges to a stationary, but not locally maximal, point, although we have not 
observed this in our experiments. We refer to (Wu, 1983) for a detailed treatment of the convergence 
properties of EM style algorithms. 


3. Clustering by Mode Identification 


We focus on clustering continuous vector data although the framework extends readily to discrete 
data. Given a data set {x1,X2, ...,Xn }, X; € RI, a probability density function for the data is estimated 
nonparametrically using Gaussian kernels. As the kernel density estimate is in the form of a mixture 
distribution, MEM is applied to find a mode using every sample point x;, i = 1,...,n, as the initial 
value for the iteration. Two points x; and x; are grouped into one cluster if the same mode is obtained 
from both. When the variances of Gaussian kernels increase, the density estimate becomes smoother 
and tends to group more points into one cluster. A hierarchy of clusters can thus be constructed by 
gradually increasing the variances of Gaussian kernels. Next, we elaborate upon the clustering 
algorithm, illustrate it with an example, and discuss approaches to speed up computation. 


3.1 The Algorithm 
Let the set of data to be clustered be S = {x1,x2,...,Xn}, Xi € RI., The Gaussian kernel density 


estimate is formed: 


f(x) = O(a xE), 


sie 


i=1 


where the Gaussian density function 





Hog Ga). 


1 
9 1,2) = aaa Oa 


We use a spherical covariance matrix © = diag(o*,o7,...,67). The standard deviation © is also 
referred to as the bandwidth of the Gaussian kernel. We use notation D(o*) = diag(o*, 0”, ...,07) 
for brevity. 

With a given Gaussian kernel covariance matrix D(o7), data are clustered as follows: 


1. Form kernel density 


Lol [x D(0), D 


Ms 


falSio)= 


l 


Il 
= 


2. Use f(x|S,o7) as the density function. Use each x;, i = 1,2,...,n, as the initial value in the 
MEM algorithm to find a mode of f(x|S,o7). Let the mode identified by starting from x; be 
Mo(xi). 


1692 


NONPARAMETRIC MODAL CLUSTERING 


3. Extract distinctive values from the set {Mo(x;),i = 1,2,...,n} to form a set G. Label the 
elements in G from 1 to |G]. In practice, due to finite precision, two modes are regarded equal 
if their distance is below a threshold. 


4. If Mo(xi) equals the kth element in G, x; is put in the kth cluster. 


In the basic version of the algorithm, the density f(x|S,07) is a sum of Gaussian kernels centered 
at every data point. However, the algorithm can be carried out with any density estimate in the form 
of a mixture. The key step in the clustering algorithm is the identification of a mode starting from 
any xi. MEM moves from x; via an ascending path, or, figuratively, via hill climbing, to a mode. 
Points that climb to the same mode are located on the same hill and hence grouped into one cluster. 
We call this the Mode Association Clustering (MAC) algorithm. 

Although the density of each cluster is not explicitly modeled by MAC, this nonparametric 
method retains a major advantage of mixture-model-based clustering, that is, a pdf is obtained 
for each cluster. These density functions facilitate soft clustering as well as cluster assignment of 
samples outside the data set. Denote the set of points in cluster k, 1 < k < |G|, by Cy. The density 
estimate for cluster k is 


a= F Zoe lD). (2) 


xi:xiECk 


Because we do not assume a parametric form for the densities of individual clusters, our method 
tends to be more robust and characterizes clusters more accurately when the attempted parametric 
assumptions are violated. 

It is known in the literature of mixture modeling that if the density of a cluster is estimated using 
only points assigned to this cluster, the variance tends to be under estimated, although the effect on 
clustering may be small (Celeux and Govaert, 1993). The under estimation of variance becomes 
more severe for poorly separated clusters, which often decay towards zero too quickly on leaving 
the cluster. We will see a similar phenomenon here with g(x) having over fast decaying tails. A 
correction to this problem in mixture modeling is to use soft instead of hard clustering. Every point 
is allowed to contribute to every cluster by a weight computed from the posterior probability of the 
cluster. 

Under this spirit, we can make an ad-hoc modification on the density estimation. With g(x) in 
(2) as the initial cluster density, compute the posterior of cluster k given each x; by pix « Glej (x), 


k= 1, ..., |G|, subject to E Pix = 1. Form the updated density of cluster k by 


Eii PiKO(X | xi, D(0°)) 
ini Pik l 





&x(x) = 


With the cluster density modified, it is natural to question if p;x should be updated again, which 
in turn will lead to another update of (x). This iterative procedure can be carried out infinitely. 
Whether it converges is not clear and may be worthy of investigation. On the other hand, if the 
maximum a posteriori clustering based on the final &;,(x) differs significantly from the result of 
modal clustering, this procedure may have defeated the very purpose of modal clustering and turned 
it into merely an initialization scheme. We thus do not recommend many iterative updates on %;(x). 
One round of adjustment from g(x) may be sufficient. Or one can take g(x) cautiously as a smooth 
signature of a cluster, likely tighter than an accurate density estimate. 
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When the bandwidth o increases, the kernel density estimate f(x|S,o7) in (1) becomes smoother 
and more points tend to climb to the same mode. This suggests a natural approach for hierarchical 
clustering. Given a sequence of bandwidths 0; < 62 < --- < Gy, hierarchical clustering is performed 
in a bottom-up manner. We start with every point x; being a cluster by itself. The set of cluster 
representatives is thus Go = S = {x1,...,%n}. This extreme case corresponds to the limit when © 
approaches 0. At any bandwidth 6), the cluster representatives in G;_; obtained from the preceding 
bandwidth are input into MAC using the density f(x|S,o7). Note that the kernel centers remain 
at all the original data points although modes are identified only for cluster representatives when 
l > 1. The modes identified at this level form a new set of cluster representatives G;. This procedure 
is repeated across all 0;’s. We refer to this hierarchical clustering algorithm as Hierarchical MAC 
(HMAC). It corresponds to the mappings x; —> Mg, (xi) > Mo, (Moy, (xi)) > 

Denote the partition of points obtained at bandwidth ©; by #, a function mapping x;’s to cluster 
labels. If K clusters labeled 1, 2, ..., K, are formed at bandwidth ©;, ;(x;) € {1,2,...,K}. HMAC 
ensures that ;’s are nested, that is, if P;(x;) = ®(x;), then P41(xi) = P41(x;). Recall that the 
set of cluster representatives at level / is G;. HMAC starts with Go = {x1,...,x,} and solves Gy, 
1=1,2,...,n, sequentially by the following procedure: 


1. Form kernel density 
n 
1 
f(x|S,67) ZLM x | xi, D(07)). 


n 


2. Cluster G;_; by MAC using density f(x|S,07). Let the set of distinct modes obtained be Gj. 


3. If ®_1(x;) =k and the kth element in G;_, is clustered to the k’th mode in G}, then P;(x;) = K’. 
That is, the cluster of x; at level / is determined by its cluster representative in Gj_1. 


It is worthy to note that HMAC differs profoundly from linkage clustering, which also builds a 
hierarchy of clusters. In linkage clustering, at every level, only the two clusters with the minimum 
pairwise distance are merged. The hierarchy is constructed by a sequence of such small greedy 
merges. The lack of overall consideration tends to result in skewed clusters. In HMAC, however, at 
any level, the merging of clusters is conducted globally and the effect of every original data point 
on clustering is retained through the density f(x|S,07). 


3.2 An Example 


We now illustrate the HMAC algorithm using a real data set. This is the glass identification data 
provided by the UCI machine learning database repository (Blake et al., 1998). The original data 
were contributed by Spiehler and German at the Diagnostic Products Corporation. For clarity of 
demonstration, we take 105 sample points from two types of glass in this data set. Moreover, we 
only use the first two principal components of the original 9 dimensions. 

HMAC is applied to the data using a sequence of kernel bandwidths 6; < 62 <--- < Oy, N = 20, 
chosen equally spaced from [0.16,26] = [0.225,4.492], where ô is the larger one of the sample 
standard deviations of the two dimensions. Among the 20 different o;’s, only 6 of them result in 
clustering different from 6)_1, reflecting the fact that the bandwidth needs to increase by a sufficient 
amount to drive the merging of some existing cluster representatives. The number of clusters at the 
6 levels is sequentially 21, 11,5, 3, 2, 1. 
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We demonstrate the clustering results at the 2nd and 3rd level in Figure 1. At the 2nd level, 11 
clusters are formed, as shown by different symbols in Figure 1(a). The 11 modes identified at level 
2 are merged into 5 modes at level 3 when the bandwidth increases from 0.449 to 0.674. Figure 1(b) 
shows the ascending paths generated by MEM from the 11 modes (crosses) at level 2 to the 5 modes 
(squares) at level 3. The contour lines of the density function with the corresponding bandwidth of 
level 3 are plotted in the background for better illustration. The 5 clusters at level 3 are shown in 
Figure 1(c). These 5 modes are again merged into 3 modes at level 4, as shown in Figure 1(d). 


3.3 Measures for Enhancing Speed 


Because the nonparametric density estimate in (1) is a sum of kernels centered at every data point, 
the amount of computation to identify the mode associated with a single point grows linearly with 
n, the size of the data set. The computational complexity of clustering all the data by MAC is thus 
quadratic in n. In practice, however, it is often unnecessary to use the basic kernel estimate. A 
preliminary clustering can be first applied to {x1,...,x,} to yield m clusters, where m is significantly 
smaller than n, but still much larger than the desired number of clusters. Suppose the m cluster 
centroids are § = {X1,...,X%m} and the number of points in cluster Sj isnj, j =1,2,...,m. We use the 


density estimate 
m 


f(e| 5,D(6)) = È Fo z),D(6")) 
j=l 
in MAC to cluster the x;’s. Since MEM applies to general mixture models, the modified density 
function causes no essential changes to the clustering procedure. 

The purpose of the preliminary clustering is more of quantizing the data than clustering. Com- 
putation is reduced by not discerning points in the same quantization region when formulating the 
density estimate. If m is sufficiently large, S is adequate to retain the topological structures in the 
nonparametric density estimate. In this fast version of MAC, we search for a mode for every 5}. 
Examples exploiting the fast MAC are given in Section 6. 


4. Analysis of Cluster Separability via Ridgelines 


A measure for the separability between clusters is useful for gaining deeper understanding of clus- 
tering structures in data. With this measure, the difference between clusters is quantified, rather 
than being simply categorical. This quantification can be useful in certain situations. For instance, 
in taxonomy study, after grouping instances into species, scientists may need to numerically assess 
the disparity between species, often taken as an indicator for evolutionary proximity. A separability 
measure between the clusters of species can effectively reflect such disparity. Such a measure is also 
useful for diagnosing clustering results and for the mere interest of designing clustering algorithms. 
Based upon it, we derive a mechanism to merge weakly separated clusters. Although the separabil- 
ity measure is a diagnostic tool and the cluster merging method can adjust the number of clusters, in 
this paper, we do not pursue the problem of choosing the number of clusters fully automatically. It 
is well known that determining this number is a deep problem, and domain knowledge often needs 
to be called upon for a final decision in various applications. 

The separability measure we define here exploits the geometry of the density functions of two 
clusters in a comprehensive manner. We only require the cluster pdf to be a mixture distribution, 
for example, a Gaussian kernel density estimate. Before formulating the separability measure, we 
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Figure 1: Clustering results for the glass data set obtained from HMAC. (a) The 11 clusters at level 
2. (b) The MEM ascending paths from the modes at level 2 (crosses) to the modes at 
level 3 (squares), and the contours of the density estimate at level 3. (c) The 5 clusters 
at level 3. (d) The ascending paths from the modes at level 3 (crosses) to those at level 4 
(squares) and the contours of the density estimate at level 4. (e) Ridgelines between the 
3 major clusters at level 3. (f) The density function along the 3 ridgelines. 
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define a ridgeline between two unimodal clusters. The REM algorithm is developed to solve the 
ridgeline. 


4.1 Ridgeline 


The ridgeline between two clusters with density g; (x) and g2(x) is 
L= {x(a) : (1—a)V log g1(x) + aV log go(x) =0,0<a< 1}. (3) 


For a mixture density of the two clusters, (x) = migi (x) + Mogo(x), M1 +m = 1, if g(x) > 0 
for any x, the modes, antimodes (local minimums), and saddle points of g(x) must occur in £ for 
any prior probability mı. The locations of these critical points, however, depend on 71. This fact is 
referred to as the critical value theorem and is proved by Ray and Lindsay (2005). 

Remarks: 


1. Eq. (3) is precisely the critical point equation for the exponential tilt density g(x|a) = 
c(a)g1(x)!~%go(x)%, where c() is a normalizing constant. This density family is an ex- 
ponential family, with sufficient statistic log(g2(x)/gi(x)). 


2. The set of solutions in £ is, in general, a 1-dimensional manifold; that is, a curve. When both 
gı and g2 are normal densities, the solution is explicit (see Ray and Lindsay, 2005), and the 
solutions form a unique one-dimensional curve. More generally, the solutions are possibly a 
set of curves that pass through the modes of the g; (x) and go(x). 


3. If both gı and gz are unimodal and have convex upper contour sets, it can be proved that the 
solutions form a unique curve between the modes of gı and g3 respectively. In our discussion, 
we assume unimodal gı and go. 


Since the local maxima of the exponential tilt function g(x; a) satisfy Eq. (3), we solve (3) by 
maximizing log(g(x|a)) = (1 — a) log g(x) + alog g2(x). In the case when the two cluster densi- 
ties gı and g2 are themselves mixtures of basic parametric distributions, for example, normal, we 
develop an ascending algorithm to maximize the function, referred to as the Ridgeline EM (REM) 
Algorithm. For notational brevity, assume that both gı and g2 are mixtures of T parametric distri- 
butions: 


T 
gi(x) = Ł Ti xhi (x) 3 i= 1,2 i 
K=1 


Starting from an initial value x), REM updates x by iterating the two steps: 
1. Compute 


T 
Pix = Tj in (x) / Ł T; jhi j(x") , K= I ESR i ,l= 1,2 : 
j=l 
2. Update x"): 


T T 
x) = argmax(1 — a) £ pıxloghı x(x)+a Ł p2xloghzx(x) . 
x k=1 


k=1 
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REM ensures that g(x”) | æ) > g(x | a). Proof is given in Appendix B. As with MEM, 
we will not rigorously study the convergence properties of the sequence {xC Da In the special case 
hix(x) = 0(x | uix, £), the update equation for xt!) becomes 


T T 
ger) = (1 = a) Jo Pi xlix +O > P2,«H2,« + 
K=1 K=1 
The Gaussian kernel density estimate belongs to this case. 

At the two extreme values & = 0, 1, the solutions are the modes of g; (x) and g2(x) respectively. 
We solve x(&) sequentially on a set of grid points 0 = Gp < Qı < +: < Qç = 1. First, x(0) = 
argmax, g1(x) is solved by MEM. For every 0, x(@ı—1), previously calculated, is used as initial 
value to start the iterations in REM. 

Suppose two clusters, denoted by zı and z2, have densities gı and go, and prior probabilities 7 
and 72 respectively. We define a pairwise separability as 


mina T181 (x(O)) + m82(x(0)) _ igi (x(ae")) + T282(x(0*)) 
™181(x(0)) + %282(x(0)) ™181(x(0)) + %2g2(x(0)) 


where &* = argmin, 7121 (x(0)) + M2g2(x(a)). Usually, the prior probabilities m; or T2 are propor- 
tional to the cluster sizes. It is obvious that S(z1,z2) € [0, 1]. To symmetrize the measure, we define 
the pairwise symmetric separability as S(z1,z2) = min|S(z1,z2),5(z2,21)]. 

By finding x(a@*), we can evaluate the amount of “dip” along the ridgeline. By the critical 
value theorem, the minimum of mig (x) + %2g2(x), if exists, lies on the ridgeline and therefore 
must be x(a*). Hence, if there is a “dip” in the mixture of the two clusters, it will be captured 
by the ridgeline. According to definition (4), a deeper “dip” leads to higher separability. In our 
implementation, we approximate a* by 








S(z1,Z2) =1 (4) 


*  argmin 7121 (x(a)) + M2g2(x(O)) , 
Oy ,0<1<C 


where o, O < I < Ç, are the grid points. 

We also define the separability for a single cluster to quantify its overall distinctness from other 
clusters. Specifically, suppose there are m clusters denoted by z;, i = 1,...,m. The separability of 
cluster z;, denoted by s(z;) is defined by 

s) = in ene)» 
We call a cluster “insignificant” if its separability is below a given threshold £. In our discussion, 
e=0.5. 

Take the glass data set as an example. As shown in Figure 1(c), the points are divided into 5 
groups at that level of the clustering hierarchy. Two of the clusters each contain a single sample 
point, which is far from the other points and forms a mode alone. The separability of these two 
clusters are weak, respectively 0.00 and 0.30. The three other clusters are highly separable, with 
separability values 0.94, 0.84 and 0.81. Figure 1(e) shows the ridgelines between any two of the 
three significant clusters, and Figure 1(f) shows the density function along the ridgelines, normalized 
to one at the ridgeline end point x(0) or x(1) (depending on whichever is larger). 

The task of identifying insignificant clusters in Figure 1(c) is not particularly challenging be- 
cause the two smallest clusters are “singletons” (containing only one sample). However, cluster size 
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is not the sole factor affecting separability. Take the clusters in Figure 1(a) as an example. The 
separability of the 11 clusters and their corresponding sizes (number of points contained) are listed 
in Table 1. It is shown that although cluster 9 is the third largest cluster, its separability is low. At 
threshold € = 0.5, this cluster is insignificant. In contrast, by being far from all the other clusters, 
cluster 6, a singleton, has separability 0.87. The low separability of cluster 9 is caused by its prox- 
imity to cluster 1, the largest cluster which accounts for 60% of the data. Clusters that contain a 
large portion of data tend to “absorb” surrounding smaller clusters. The attraction of a small cluster 
to a bigger one depends on its relative size, tightness, distance to the bigger cluster, as well as the 
orientation of the data masses in the two clusters. 





Cluster 1 2 3 4 5 6 7 8 9 10 11 
Size 63 1 1 2 4 1 1 1 5 1 25 
Separability | 0.94 | 0.41 | 0.41 | 0.31 | 0.68 | 0.87 | 0.13 | 0.13 | 0.18 | 0.46 | 0.92 


















































Table 1: Separability between clusters in the glass data set 


4.2 Merging Clusters Based on Separability 


To elaborate on the relationships between clusters, we compute the matrix of separability between 
any pair of clusters. We can potentially use this matrix to decide which clusters can be merged due 
to weak separation. As discussed previously, one way to merge clusters is to increase the bandwidth 
of the kernel function used by HMAC. However, an enlarged bandwidth may cause prominent clus- 
ters to be clumped while leaving undesirable small “noisy” clusters unchanged. Merging clusters 
according to the separability measure is one possible approach to eliminate “noisy” clusters without 
losing important clustering structures found at a small bandwidth. We will show by the glass data 
that the merging method makes clustering results less sensitive to bandwidth. 


Let the clusters in consideration be {z1,z2,...,Zm}. We denote the pairwise separability between 
cluster z; and z; in short by S; j, where S; į = S(Zi,Z;). Note in general S; j 4 Sji. Let a threshold for 
separability be £, 0 < € < 1. Let the density function of cluster z; be g;(-) and the prior probability 
be 7;. Denote the weighted mode of each cluster density function by 6(z;) = 7; max g;(x). Since in 
MEM the mode max g;(x) for each cluster z; is computed when z; is formed, 6(z;) requires no extra 
computation after clustering. We refer to 8(z;) as the significance index of cluster zi. 


The main idea of the merging algorithm is to have clusters with a higher significance index 
absorb other clusters that are not well separated from them and are less dominant (lower significance 
index). Several issues need to be resolved to avoid arbitrariness in merging. First, a cluster z; may be 
weakly separated from several other clusters with higher significance indices. Among those clusters, 
we let the one from which z; is worst separated to absorb z;. Second, two weakly separated clusters 
zi and z; may have the same significance indices, that is, 6(z;) = 6(z;); and hence it is ambiguous 
which cluster should be treated as the dominant one. We solve this problem by introducing the 
concept of cliques. The clusters are first grouped into cliques which contain weakly separated z,’s 
with the same value of 5(z;). Clusters in different cliques are ensured to be either well separated or 
have different significance indices. We then apply the absorbing process to the cliques, without the 
possibility of encountering the aforementioned ambiguity. 
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Next, we describe how to form cliques and extend the definition of pairwise separability to 
cliques. In order to carry out the merging of cliques, a directed graph is constructed upon the 
cliques based on pairwise separability and the comparison of significance indices. 


1. Tied: Cluster z; and z; are defined to be tied if S(z;,z;) < € and 8(z;) = 8(z;). 


2. Clique: Cluster z; and z; are in the same clique if (a) z; and z; are tied, or (b) there is a cluster 
zk such that z; and zę are in the same clique, and z; and zg are in the same clique. 


Remark: the relationship “tied” results in a partition of z;’s. Each group formed by the par- 
tition is a clique. Because being tied requires 5(z;) = 5(z;), in practice, we only observe 
clusters being tied when they are all singletons. 


We now define the separability between cliques. Without loss of generality, let clique cı = 
{21,225 +++) Zm, } and c2 = {Zm +1; -Zm +m }. Denote the clique-wise separability by S.(c1,c2): 


Se(c1,c2) min min SRE) 
1<i<m m +1<j<m +m 
Since in general, S(z;,z;) A S(zj,zi), the asymmetry carries over to Se(c1,c2) Æ Se(c2,c1). We 
also denote the significance index of a clique c; as 8(c;). Since all the clusters in c; have equal 
significance indices, we let 5(c;) = 6(zj), where zy is any cluster included in c;. 
Regard each clique as a node in a graph. Suppose there are ñ cliques {c1,...,cj}. A directed 
graph is built as follows. For two arbitrary cliques c; and c;, a link from c; to c; is made if 


1. Selci,cj) <€. 


2. Selci cj) = mingzi Se(Ci,ck) and j is the smallest index among all those j’’s that achieve 
Selci, Cj) = mings Se(Ci, Ck). 


3. S(ci) < 8(c;). 


A clique c; is said to be linked to c; if there is a directed edge from c; to cj. It is obvious by 
the second requirement in the link construction that every clique is linked to at most another clique. 
In Appendix C, it is proved that a graph built by the above rules has no loops. An example graph 
is illustrated in Figure 2(b). If we disregard the directions of the links, the graph is partitioned into 
connected subgraphs. In the given example in Figure 2(b), there are four connected subgraphs. The 
basic idea of the merging algorithm is to let the most dominant clique in one subgraph absorb all 
the others in the same subgraph. 

We call clique c; the parent clique of c; if there is a directed link from c; to cj. In this case, we 
also say c; is directly absorbed by cj. By construction, 5(c;) < 6(c;). More generally, if there is a 
directed path from c; to c;, then c; is called an ancestor of c;, and c; is absorbed by c;. Again, we 
have 8(c;) < 8(c;) by transitivity. In each connected subgraph containing k nodes, because there 
is no loop, there are exactly k— 1 links. Since every node has at most one link originating from 
it, the k— 1 links have to originate from k — 1 different nodes. Therefore, there is precisely 1 node 
in each connected subgraph that has no link originating from it. This node is called the head node 
(clique) of the connected nodes. It is not difficult to see that the head node is an ancestor for all the 
other nodes in the subgraph. As a result, the significance index of the head clique is strictly larger 
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than that of any other clique in the subgraph. In this sense, the head clique dominates all the other 
connected cliques. 

Combining clusters by the above method is the first and main step in our merging algorithm. 
To account for outliers, the second step in the algorithm employs the notation of coverage rate. 
Outlier points far from all the essential clusters tend to yield high separability and hence will not be 
merged. In HMAC, to “grab” those outliers, the kernel bandwidth needs to grow so large that under 
such a bandwidth, many significant clusters are undesirably merged. To address this issue, we find 
the smallest clusters and mark them as outliers if their total size proportional to the entire data set is 
below a threshold. For instance, if the coverage rate allowed is 95%, this threshold is then 5%. 

We now summarize our cluster merging algorithm as follows. We call the merging procedure 
conducted based on the separability measure stage I merging and that based on coverage rate stage II 
merging. The two stages do not always have to be concatenated. We can apply each alone. Applying 
only stage I is equivalent to applying two stages and setting the coverage rate to 100%; applying 
only stage II is equivalent to setting the threshold of separability to 0.0. Assume the starting clusters 
are {Z1,22,.-.,Zm}- 


1. Stage I: merging based on separability. 
(a) Compute the separability matrix [S;;], i,j = 1,...,m, and the significant index 6(z;), 
i=1,...,m. 
(b) Form cliques {c1,c2,...,¢m} based on [S; j] and 6(z;)’s, where m < m. Record the z;’s 
contained in each clique. 
(c) Construct the directed graph. 


(d) Merge cliques that are in the same connected subgraph. z;’s contained in merged cliques 
are grouped into one cluster. Denote those merged clusters by {21,22,..., 2m}, where 
M <M. 


2. Stage II: merging based on coverage rate. Denote the coverage rate by p. 


(a) Calculate the sizes of clusters {2),22,...,2} and denote them by Ĥ;, i = 1,...,7%. The 
size of the whole data set isn =)" ; Aj. 


A 


(b) Sort ñ;, i = 1,...,m, in ascending order. Let the sorted sequence be fi(1), Ai(2), -> A(a)- 
Let (o) = 0. Let k be the largest integer such that Lo Ai <(1—p)n. 

(c) If k > 0, go to the next step. Otherwise, stop and the final clusters are {21,2o,..., 2m}. 

(d) For each 2(;), i= 1,...,k, find all the original clusters z;’s that are merged into 2(;). Denote 
the index set of the z;’s by Hi). Let H' = Uk A and H” = UÊ Aj). 


i=k+1 
(e) For each Z(;), i= 1,...,k, find j* = argmin jey” minjeH,) S(zj,z;). Find the cluster 2; that 
contains zj». Merge Z(;) with 2. The new clusters obtained are {21,22,.--,Zm}, where 
m=m—k<m. 


(f) Reset m — m and Z; > 2;,i = 1,...,m. Go back to step (a). 


Unless there is a definite need to assign every data point to one of the major clusters, in certain ap- 
plications, it may be more preferable to keep the outlier status of some points rather than allocating 
them to distant clusters. 
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Figure 2: The process of merging clusters for the glass data set. (a) The clustering result after 
merging clusters in the same cliques. (b) The cliques and directed graph constructed 
based on separability. (c), (d) The clustering results after stage I and stage II merging 
respectively. 


The first stage merging based on separability is intrinsically connected with linkage-based ag- 
glomerative clustering. For details on linkage clustering, see Jain et al. (1999). In a nutshell, 
linkage clustering forms clusters by progressively merging a pair of current clusters. Initially, 
every data point is a cluster. The two clusters with the minimum pairwise distance are chosen 
to merge at each step. The procedure is greedy in nature since minimization is conducted se- 
quentially through every merge. Linkage clustering methods differ by the way between-cluster 
distance is updated when a new cluster is combined from two smaller ones. For instance, in sin- 
gle linkage, if cluster €) and €3 are merged into &4, the distance between &; and &4 is calculated 
as d(&1,€4) = min(d(€),&),d(€1,€3)). If complete linkage, the distance becomes d(&),€4) = 
max(d(§1,§2),d(&1,§3)). 

Our merging algorithm is a particular kind of linkage clustering where the elements to be clus- 
tered are cliques and the distance between the cliques is the separability. The update of this distance 
for merged groups of cliques is however different from commonly used versions in linkage clus- 
tering. The update is the same as single linkage under a certain scenario, but differs in general 
because of the directed links and the notion of head cliques. We may call this linkage clustering 
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algorithm directional single linkage for reasons that will be self-evident shortly. Consider the above 
example where €) and €3 merge into €4. We call & more dominant than € if the head clique in € 
has a higher significance index than that in &’, and denote 5(€) > 6(&’). Without loss of generality, 
assume 6(€)) > 6(€3). Then the update of d(&1,€4) and the ordering of 5(€;) and 6(&4) (needed for 
updating the distance) follows three cases: 


d(§1,§4) = 4(€1,§2), and 8(§1) > 8(&4), if 8(§1) > 8(&2) 
d(§1,&4) = d (61,62), and 8(§1) < 8(&4), if 8(§3) < 8(E1) < 8(&2) 
d(§1,54) = min(d(§1,§2),d(§1,§3)), and (61) < (G4), if (G1) < (Gs) . 


In our proposed merging procedure, we essentially employ a threshold to stop merging when all the 
between-cluster distances exceed this value. An alternative is to apply the directional single linkage 
clustering and stop merging when a desired number of clusters is achieved. 

We use the glass data set in the previous section to illustrate the merging algorithm. The thresh- 
old for separability is set to € = 0.5 and the coverage rate is p = 95%. Apply the algorithm to the 11 
clusters formed at the 2nd level of the hierarchical clustering, shown in Figure 1(a). We refer to the 
clusters as z1, ..., Z11, where the label assignment follows the indication in the figure. The 11 clus- 
ters form 9 cliques. Figure 2(a) shows the clustering after merging clusters in the same clique. The 
square (triangle) symbol used for z2 (z7) is now used for both z2 (z7) and z3 (zg) to indicate that they 
have been merged. To highlight the relationship between the merged clusters and the original clus- 
ters, the list of updated symbols for each original cluster is given in every scatter plot. Figure 2(b) 
demonstrates the directed graph constructed for the cliques. The z;’s contained in each clique are 
indicated in the node. Figure 2(c) shows the clustering result after stage I merging. The symbol of 
the head clique in each connected subgraph is adopted for all the clusters it absorbs. The 4 clusters 
generated at stage I contain 73, 25, 6, 1 points respectively. At p = 95%, only the cluster of size 1 
is marked as an outlier cluster, and is merged with the cluster of size 6. 

Because clusters with low separability are apt to be grouped together when the kernel band- 
width increases, it is not surprising for the clustering result obtained by the merging algorithm to 
agree with clustering at a higher level of HMAC. On the other hand, examining separability and 
identifying outlier clusters enhance the robustness of clustering results, a valuable trait especially 
in high dimensions. Examples will be shown in Section 6. For practical interest, when equipped 
with the merging algorithm, we do not need to generate all the hierarchical levels in HMAC until 
reaching a targeted number of clusters. Instead, we can apply the merging algorithm to a relatively 
large number of clusters obtained at an early level and reduce the number of clusters to the desired 
value. 


5. Visualization 


For clarity, we have used 2-D data to illustrate our new clustering methods, although these methods 
are not limited by data dimensions. Projection into lower dimensions is needed to visualize clus- 
tering results for higher dimensional data. PCA (principal component analysis) is a widely used 
linear projection method, but it is not designed to reveal clustering structures. We will describe in 
this section a linear projection method that aims at effectively showing clustering structures. The 
method is employed in our experiments. We note that visualization is a rich research area in its own 
right. However, because this topic is beyond the focus of the current paper, we will not discuss it in 
great depth, nor make thorough comparisons with other methods. 


1703 


LI, RAY AND LINDSAY 


Modal clustering provides us an estimated density function and a prior probability for each clus- 
ter. Suppose K clusters are generated. Let the cluster density function of x, x € RI, be g(x), and 
the prior probability be Tg, k = 1,2,...,K. For any x € RI, its extent of association with each cluster 
k is indicated by the posterior probability p(x) œ mTggg(x). To determine the posterior probabili- 
ties p(x), under a given set of priors, it suffices to specify the discriminant functions log ao, ny, 
e0, Without loss of generality, we use gg (x) as the basis for computing the ratios. Our pro- 


jection method attempts to find a plane such that log = a ,k=1,...,K — 1 can be well approximated 

if only the projection of data into the plane is specified. By preserving the discriminant functions, 
the posterior probabilities of clusters will remain accurate. 

Let the data set be {x1,x2,...,%n}, Xi € Rİ, Denote a particular dimension of the data set by 

Xi 

zt 

. Linear regression is performed based on the pairs (x;, Yik), i = 


log 








X= (X11,X21, ---;Xn1), L= 1,...,d. For each k, k = 1,...,K — 1, the pairs (x;, log 


8i(%i) 
8K (xi) 
1,...,2, to acquire a linear approximation for each discriminant function. Let B.o, Bx,1,Bx2,---; Bea 
be the regression coefficients for the kth discriminant function. Denote By = (Bx,1, Bk,2, ---, Bea)’ and 
the fitted values for log glz) by fix = Beo + Bixi. Also denote ¥;% = Bixi = fix — Bk,o. For mathe- 
matical tractability, we convert the approximation of the discriminant functions to the approximation 
of their linearly regressed values (9; 1,9; 2, -i k-1), i= 1,...,n, which is equivalent to approximate 
(¥i,1,5i,2;-+-,9i,K—1) Since the two only differ by a constant. To precisely specify (9:1, 9i2, -Ji K-1), 
we need the projection of x; onto the K — 1 directions, B,, Bo, ..., Bx—1. If we are restricted to show- 
ing the data in a plane and K — 1 > 2, further projection of (¥;,1,5;.2,...,5i,x—1) is needed. At this 
stage, we employ PCA on the vectors (¥,1,5j,2, ---,9i,x—1) (referred to as the discriminant vectors), 
i= 1,...,n, to yield a two-dimensional projection. Suppose the two principal component directions 
for the discriminant vectors are yj = (Yj,1,---;Y;,K-1)', J= 1,2. The two principal components vj, 


j= 1,2, are 





jis Pere 





are computed. Let y; = log 


Vij fi, fi k-1 
v2 j 21 Î2K—1 d |K- 
=i) te typi] | =} |} ¥eBen lea 
: : [=1 | k=1 
Vn, j In Ýn, K-1 


To summarize, the two projection directions for x; are 


K-1 K-1 RA t 
t= | X Yaben X Wabe $ YkBea) > j=1,2. (5) 
k=l k=l k=l 


In practice, it may be unnecessary to preserve all the K — 1 discriminant functions. We can apply 
the above method to a subset of discriminant functions corresponding to major clusters. The two 
projection directions in (5) are not guaranteed to be orthogonal, but it is easy to find two orthonormal 
directions spanning the same plane. 

If we use a basis function other than gx(x), say gx (x), to form the discriminant functions, the 
new set of vectors B,’s will span the same linear space as the Bg’s obtained with gx(x). The reason is 


that the new discriminant functions log a a ,1<k<K,k#k', can be linearly transformed from the 
previously defined (y;,1,Y;,2, ---, Y; K—1) by log = = yik — Yik, for k # k',K, log si = —y; x, and 
linear regression is used on the y;;’s. On the other hand, as the linear transform is not orthonormal, 


the PCA result is not invariant under the transform and the projection directions T; can change. 
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6. Experiments 


We present in this section experimental results of the proposed modal clustering methods on simu- 
lated and real data. We also discuss measures for enhancing computational efficiency and describe 
the application to image segmentation, which may involve clustering millions of data points. 


6.1 Simulated Data 


We experiment with three simulated examples to illustrate the effectiveness of modal clustering. We 
start by comparing linkage clustering with mixture modeling using two data sets. This will allow 
us to illustrate the strengths and weaknesses of these two methods and therefore better demonstrate 
the tendency of modal clustering to combine their strengths. We then present a study to assess the 
stability of HMAC over multiple implementations and its performance under increasing dimensions. 
In this study, comparisons are made with the Mclust function in R, a state-of-the-art mixture-model- 
based clustering tool (Fraley and Raftery, 2002, 2006). 

For our two data sets, single linkage, complete linkage, and average linkage yield similar re- 
sults. For brevity, we only show results of average linkage. In average linkage, if cluster z2 and 
z3 are merged into z4, the distance between zı and z4 is calculated as d(z1,z4) = ae (21,22) + 
mona (z1,z3), where nz and n3 are the cluster sizes of z2 and z3 respectively. Details on clustering 
by mixture modeling are referred to Section 1.1. We will also show results of k-means clustering. 

The first data set, referred to as the noisy curve data set, contains a half circle and a straight line 
(or bar) imposed with noise, as shown in Figure 3(a). The circle centers at the origin and has radius 
7. The line is a vertical segment between (13,—8) and (13,0). Roughly 4 of the 200 points are 
uniformly sampled from the half circle and 1 of them uniformly from the bar. Then, independent 
Gaussian noise with standard deviation 0.5 is added to both the horizontal and vertical directions of 
each point. 

Consider clustering into two groups. The results of average linkage, mixture modeling, and 
k-means are shown in Figure 3(a), (b), (c). For this example, average linkage partitions the data 
perfectly into a noisy half circle and bar. Results of mixture modeling and k-means are close. In 
both cases, nearly one side of the half circle is grouped with the bar. In this example, the mixture 
model is initialized by the clustering obtained from k-means; and the covariance matrices of the two 
clusters are not restricted. 

The second data set contains 200 samples generated as follows. The data are sampled from two 
clusters with prior probability 1/3 and 2/3 respectively. The first cluster follows a single Gaussian 
distribution with mean (6,0) and covariance matrix diag(1.5*, 1.5”). The second cluster is generated 
by a mixture of two Gaussian components with prior probability 1/5 and 4/5, means (—3,0) and 
(0,0), and covariance matrices diag(3?,3?) and diag(1,1) respectively. The two clusters are shown 
in Figure 4(a). Again, we compare results of average linkage, mixture modeling, and k-means, 
shown in Figure 4(b), (c), (d). For mixture modeling, we use Mclust with three components and 
optimally selected covariance structures by BIC. Two of the three clusters generated by Mclust are 
combined to show the binary grouping. For this example, mixture modeling and k-means yield a 
partition close to the two original clusters, while average linkage gives highly skewed clusters, one 
of which contains a very small number of points on the outskirt of the mass of data. 

We apply HMAC to both data sets and show the clustering results obtained at the level where 
two clusters are formed. For the noisy curve data set, HMAC perfectly separates the noisy half circle 
and the bar, as shown in Figure 3(a). For the second example, shown in Figure 4(e), HMAC yields 
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Figure 3: Clustering results for the noisy curve data set. (a) The two original clusters. The two 
clusters obtained by average linkage clustering and HMAC are identical to the original 
ones. (b) Clustering by mixture modeling with two Gaussian components. (c) Clustering 
by k-means. (d) Clustering by HMAC at the first level of the hierarchy. 


clusters closest to the original ones among all the methods. Comparing with mixture modeling, 
HMAC is more robust to the non-Gaussian cluster distributions. 

The above two data sets exemplify situations in which either the average linkage clustering or 
mixture modeling (or k-means) performs well but not both. In the first data set, the two clusters are 
well separated but seriously violate the assumption of Gaussian distributions. By greedy pairwise 
merging, average linkage successfully divides the two clusters. In contrast, both mixture modeling 
and k-means attempt to optimize an overall clustering criterion. Mixture modeling favors elliptical 
clusters because of the Gaussian assumption, and k-means favors spherical clusters due to extra 
restrictions on Gaussian components. As a result, one side of the noisy half circle is grouped with 
the bar to achieve better fitting of Gaussian distributions. On the other hand, mixture modeling 
and k-means perform significantly better than average linkage in the second example. The greedy 
pairwise merging in average linkage becomes rather arbitrary when clusters are not well separated. 

HMAC demonstrates a blend of traits from average linkage and mixture modeling. When the 
kernel bandwidth is small, the cluster to which a point is assigned is largely affected by its neighbors. 
Points close to each other tend to be grouped together, as shown by Figure 3(d) and Figure 4(f). 
This strong inclination of putting neighboring points in the same cluster is also a feature of average 
linkage. However, a difference between HMAC and average linkage is that decisions to merge in 
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Figure 4: Clustering results for the second simulated data set. (a) The original two clusters. (b) 
Clustering by average linkage. (c) Clustering by mixture modeling using Mclust with 
three Gaussian components. Two clusters are merged to show the binary grouping. (d) 
Clustering by k-means. (e) Clustering by HMAC. (f) Clustering by HMAC at the first 
level of the hierarchy. 
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the latter are always local. While in HMAC, when the bandwidth increases, global characteristics 
of the data become highly influential on clustering results. Hence the clustering result tends to 
resemble that of mixture modeling (k-means) which enforces a certain kind of global optimality. In 
spite of this similarity, HMAC is more robust for clusters deviating from Gaussian distributions. In 
practice, it is usually preferable to ensure very close points are grouped together and in the mean 
time to generate clusters with global optimality. These two traits, however, are often at odds with 
each other, a phenomenon discussed in depth by Chipman and Tibshirani (2006). 

Chipman and Tibshirani (2006) noted that bottom-up agglomerative clustering methods, such 
as average linkage, tend to generate good small clusters but suffer at extracting a few large clusters. 
The strengths and weaknesses of top-down clustering methods, such as k-means, are the opposite. 
A hybrid approach is proposed in that paper to combine the advantages of bottom-up and top- 
down clustering, which first seeks mutual clusters by bottom-up linkage clustering and then applies 
top-down clustering to the mutual clusters. HMAC also integrates the strengths of both types of 
clustering, in a way not as explicit as the hybrid method but more automatically. 

To systematically study the performance of HMAC for high dimensional data and its stability 
over multiple implementations, we conduct the following experiment. We generate 55 random data 
sets each of dimension 50 and size 200. The first two dimensions of the data follow the distribution 
of the noisy curve data in the first example described. The other 48 dimensions are independent 
Gaussian noise all following the normal distribution with mean zero and standard deviation 0.5, 
same as the noise added to the half circle and line segment in the first two dimensions. As gold 
standard, we regard data generated by adding noise to the half circle as one cluster and those to the 
line segment as the other. 

Clustering results are obtained for each of the 55 data sets using HMAC and Mclust respectively. 
For HMAC, the level of the dendrogram yielding two clusters is chosen to create the partition of the 
data. In another word, the basic version of HMAC is used without the separability based merging of 
clusters. For Mclust, we set the number of clusters to 2, but allow the algorithm to optimally select 
the structure of the covariance matrices using BIC. All the structural variations of the covariance 
matrices provided by Mclust are searched over. In Mclust, the mixture model is initialized using the 
suggested default option, that is, to initialize the partition of data by an agglomerative hierarchical 
clustering approach, an extension of linkage clustering based on Gaussian mixtures (Fraley and 
Raftery, 2006). This initialization may be of advantage especially to data in this study because as 
shown previously, linkage clustering generates perfect clustering for the noisy curve data set in the 
first example. 

Denote each data set k, k = 1,2,...,55, by Ay = (x i = 1,...,200}, where x (x64) w 


= 1o Xi2o 0o 
k : ; ; kl k) (k k 
( ) we forma sequence of its lower dimensional vectors: x! #) (x! i „xí A ; „ax l Ji, 


l = 2,3,...,50. Let Akı = piP i = 1,...,200}. HMAC and Mclust are applied to every Ax, 


l = 2,...,50, k = 1,...,55. Let the clustering error rate obtained by HMAC for Ag, be ri) and 


that by Mclust be r0, We summarize the clustering results in Figure 5. 


To assess the variation of clustering performance over multiple implementations, we compute 
the percentage of the 55 data sets that are not perfectly clustered by the two methods at each dimen- 
sion / = 2,...,50. Figure 5(a) shows the result. For HMAC, this percentage stays between 25% and 
32% over all the dimensions. While, for Mclust, the percentage is consistently and substantially 
higher, and varies greatly across the dimensions. For / > 43, none of the data sets can be perfectly 
clustered by Mclust. 


Mee . For each x 
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Figure 5: Clustering results obtained by HMAC and Mclust for the 55 high dimensional noisy curve 
data sets. (a) The percentage of data sets that are not perfectly clustered with increasing 
dimensions. (b) The average and median of clustering error rates. 


To examine clustering accuracy with respect to increasing dimensions, for each / = 2, ..., 50, the 


mean pu) a, and the median a an over the data sets are computed and shown in Figure 5(b). 


g 


For HMAC, the mean r% varies only slightly around 7.5% when the dimension increases, and 


the median pn stays at zero. The error rates obtained from Mclust, on the other hand, change 
dramatically with the dimension. And for / = 2 and / > 30, both 7M) and a are significantly 


higher than r% Interestingly, the worst error rate from Mclust is achieved at / = 2 with 7M) > 23% 


rather than at the high end of the range of dimensions. This seems to suggest that the extra noise 
dimensions help to mollify the effect of non-Gaussian shaped clusters. When / > 30, 7M) and ae 


increase steadily with the dimension and eventually reaches nearly 20%. ' 


Because it is expected that Gaussian components in the mixture model cannot well capture the 
half circle structure in the data, we have also tested Mclust with 3 components so that the half circle 
can possibly be characterized by two components. In this case, two of the three clusters need to be 
combined in order to compute the accuracy with respect to the original two classes. The selection 
of the two clusters to combine is not trivial. If we use a simple rule of combining the two clusters 
with the closest pair of centers, the average classification accuracies (over different dimensions) 
are mostly inferior to those by Mclust with 2 components. Note that the ridgeline based merging 
procedure in Section 4 may be invoked instead. A detailed examination in this direction is out of the 
scope of this paper. If we search through all the possible combinations and always choose the one 
that yields the best classification accuracy, the average accuracies are better than those achieved by 
HMAC. However, this comparison inherently favors Mclust because the true class labels are used 
to decide the binary grouping of the 3 components, while HMAC is purely unsupervised. 
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6.2 Real Data Sets 


In this section, we apply HMAC to real data sets, compare our method of visualization described 
in Section 5 with PCA, and demonstrate the usage of the cluster merging algorithm described in 
Section 4.2. 


6.2.1 GLASS DATA 


We first examine the aforementioned glass data set using the entire set of 214 samples and all 
the original 9 features. Applying HMAC, a dendrogram of 10 levels is obtained, as shown in 
Figure 6(a). The number of clusters at each level is listed in Table 2. The sizes of the largest 4 
clusters at each level are also given in this table. The dendrogram and the table show that 3 clusters 
containing more than 5 points are formed at the first level. These 3 prominent clusters are retained 
or augmented up to level 3. At level 4, two of the 3 prominent clusters are merged, leaving 2 
prominent clusters which are further merged at level 7. At this level, both the dendrogram and 
the table suggest the main clustering structure in the data is annihilated by the very large kernel 
bandwidth. However, the number of clusters generated at level 7 is 7 instead of 1. Except the largest 
cluster, the other 5 clusters each contain no more than 3 points and in total only 9. They may be 
considered more appropriately as “outliers” than clusters. At even higher levels, these tiny clusters 
are merged gradually into the main mass of data. 




















Level 1 2 3 4 5 6 7 8 9 10 
# clusters 29 | 25 18 15 13 11 7 6 4 3 
Size of 1st cluster | 160 | 163 | 176 | 177 | 179 | 180 | 205 | 208 | 210 211 
Size of 2nd cluster | 12 12 12 | 21 21 22 3 2 2 2 
Size of 3rd cluster 9 9 9 3 3 3 2 1 1 1 
Size of 4th cluster 5 6 2 2 2 2 1 1 1 not exist 









































Table 2: The clustering results for the full glass data set. 


The above discussion suggests that to obtain a given number of clusters, it is not always a good 
practice to choose a level in the hierarchy that yields the desired number of clusters. We may select a 
level at which major clusters are merged and outliers are mistaken as plausible clusters. One remedy 
is to apply the cluster merging algorithm to a larger number of clusters formed at a lower level. We 
will present results of this approach shortly. 

To compare our visualization method with PCA, the clustering results at level 3 are shown in 
Figure 6(b) and (c) using the two projection methods respectively. Both projections are orthonormal. 
In our visualization method, the projection only attempts to approximate the discriminant functions 
between the 3 major clusters. It is obvious that the new visualization method shows the clustering 
structure better than PCA. The two projection directions derived by our method are used when 
presenting other clustering results for a clear correspondence between points in different plots. 

If we apply our cluster merging algorithm at level 3, we obtain results shown in Figure 6(d) and 
(e). In Figure 6(d), three clusters are formed by applying stage II merging based on coverage rate. 
The parameter p = 95%. Merging based on separability is not conducted because we attempt to get 
3 clusters and the 3 largest clusters at this level already account for close to 95% of the data. This 
clustering result is much more preferable than the 3 clusters directly generated by HMAC at level 
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Figure 6: Results for the full glass data set. (a) The dendrogram created by HMAC. (b) Visual- 
ization by our method using regression on discriminant functions. (c) Visualization by 
PCA. (d), (e) Three (two) clusters obtained by applying the merging algorithm to clusters 
generated by HMAC at level 3. 
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Figure 7: The dendrogram generated by HMAC for the infant data set. 





10, containing 211, 2, and 1 points respectively. If we apply merging based on separability with 
threshold € = 0.4 (stage I) and merging based on coverage rate with p = 95% (stage II), we obtain 
two clusters shown in Figure 6(e). 


6.2.2 INFANT ATTENTION TIME 


The second real data set is provided by Hoben Thomas, funded by the German Research Foundation 
Grant (Az. Lo 337/19-2), in the Department of Psychology at Penn State. In a study of infants’ 
behavior, 51 infants were tested in two occasions apart by several months. In each occasion, a visual 
stimulus was given to the infants repeatedly for 11 times, with a fixed amount of time separating the 
stimuli. An infant’s attention time in every stimulus was recorded. We thus have a data set of 51 
samples with dimension 22. It is of interest to examine whether the infant data possess clustering 
structure and the behavior patterns of different groups. It is challenging to cluster this data set 
because of the high dimensionality and relatively small sample size. 


Applying HMAC to the data, we obtain a dendrogram shown in Figure 7. At level 2, two promi- 
nent clusters emerge, containing 8 and 27 samples respectively. All the other clusters are singletons. 
At the next higher level, the two main clusters are merged. Because all the other clusters are sin- 
gletons, we essentially partition the data into a main group and several outliers. There is no clear 
clustering structure preserved after level 2. Figure 8(a) and (b) show the clustering results obtained 
at level 2 using projection directions derived by our visualization method and PCA respectively. 
Specifically, in our method, the density of the largest cluster is used as the basis to form the dis- 
criminant functions of the other clusters. The projection directions are derived to best preserve the 
discriminant functions of the second and third largest clusters. The separation of the 2 main clusters 
is reflected better in (a) than (b). The ridgeline between the two major clusters at level 2 is com- 
puted, and the density function along this ridgeline is plotted in Figure 8(c). The plot shows that the 
“valley” between the two clusters is not deep comparing with the peak of the less prominent cluster, 
indicating weak separation between the clusters. 


As the dendrogram suggests, if we need to cluster the infants into two groups, specifically, to 
assign the singletons into the two main clusters, we cannot simply cut the dendrogram at a level that 
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Figure 8: Results for the infant data set. (a) Visualization by our method using regression on dis- 
criminant functions. (b) Visualization by PCA. (c) Density function along the ridgeline 
between the two major clusters at level 2. (d) Two clusters obtained by applying the 
merging algorithm to clusters generated by HMAC. (e) The modal curves of the two 
main clusters obtained by HMAC at level 2. Dashed line: cluster 1. Solid line: cluster 
2. Dash-dot line: the observation on the 10th infant, who deviates enormously from the 
cluster modes. (f) The mean curves of the two clusters obtained by fitting a mixture of 2 
Gaussian components. 
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yields two clusters. For this purpose, we use the stage II merging procedure with a coverage rate of 
85%. The clustering result is shown in Figure 8(d). 


To compare the infants in the two main clusters identified by HMAC, we plot the “modal curves” 
for both test occasions in Figure 8(e). By modal curve, we refer to the mode of a cluster displayed 
with the dimension index as the horizontal axis and the value in that dimension as the vertical axis. 
It is meaningful to view a mode as a curve in this study because the experiments were conducted 
sequentially and dimension i corresponds to the ith measurement in the sequence. According to 
Figure 8(e), the main difference between the two groups of infants is their attention time for the 
first stimulus in each test occasion. The circle cluster exhibits a significantly longer attention time 
for the initial stimulus in both occasions. According to HMAC, 8 infants belong to the long initial 
attention time group and 27 infants belong to the short time group. The other 16 infants’ data are 
distinct from both groups. In Figure 8(e), the attention time of the 10th infant is shown by the dash- 
dot lines. This infant is the most “extreme” outlier which corresponds to the right most branch in 
the dendrogram of Figure 8(c). The plot shows that this infant behaved very differently from the 
average in the first test. Instead of gradually decreasing, his attention time jumped to a very high 
value for the third stimulus and again for the last stimulus. If a clear-cut two groups are desired, 
Figure 8(d) shows that 13 infants are partitioned into the long time group and 38 into the short time 
group. 

We also perform clustering on this data set by fitting a mixture of two Gaussian components. The 
EM algorithm is used to estimate the mixture model using k-means clustering to initialize, and the 
maximum a posterior criterion is used to partition the data. For the two clusters obtained, we display 
the two Gaussian mean vectors as curves in Figure 8(f). Just as in the clustering by HMAC, one 
cluster had longer attention time to the initial stimulus in both occasions. However, the difference 
between the attention time in the first occasion for the two clusters was not substantial. Using 
mixture modeling, the long time group included 26 infants while the short time group included 25. 


6.2.3 NEWSGROUP DATA 


In the previous two examples, we cannot quantitatively examine the clustering performance be- 
cause there are no given class labels for the data points. In the third example, we use a data 
set containing documents labeled by different topics. This data set is taken from the newsgroup 
data (Lang, 1995). Each instance corresponds to a document in one of the two computer related 
topics: comp.os.ms-windows.misc (class 1) and comp.windows.x (class 2). The numbers of in- 
stances in the two classes are close: 975 (class 1) and 997 (class 2). The raw data for each document 
simply contain the words appeared in this document and their counts of occurrences. We process the 
data by stemming the words and employing two stages of dimension reduction. For details on the 
dimension reduction procedure, which relies mainly on two-way mixture modeling, we refer to (Li 
and Zha, 2006). In summary, we represent every document by a 10-dimensional vector, where each 
dimension is the total number of occurrences for a chosen collection of words. We then normalize 
the vector such that each dimension becomes the frequency of the corresponding set of words. 


Due to the large data size, it is unrealistic to show the dendrogram generated by HMAC directly. 
Instead, we provide a summarized version of the dendrogram emphasizing prominent clusters in 
Figure 9(a). In order to retain as much information as possible in the summary, a cluster will 
be regarded “prominent” and shown individually if it contains at least 3% of the total data. This 
requirement for showing a cluster is rather mild. The number in each node box in the tree indicates 
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Figure 9: Results for the document data set. (a) The summarized dendrogram showing the promi- 
nent clusters. (b) The density function along the ridgeline between the two major clusters 
at level 12. 


the cluster size. The number of data points not covered by the prominent clusters at any level is 
shown in the box to the right of the dendrogram. As shown in the figure, prominent clusters do 
not appear until level 6. At level 6, the four prominent clusters are very small. About 77% of the 
data are scattered in tiny clusters. At level 8, two major clusters emerge, each accounting for nearly 
39% of the data. The rest data do not form any prominent clusters. The two major clusters remain 
through level 8 to level 18, and each absorbs more data points at every increased level. For brevity, 
we omit showing the sizes of these two clusters between level 9 and 18. At level 19, the two major 
clusters are merged, and there are 19 outlier points not absorbed into the main mass of data. This 
dendrogram strongly suggests there are two major clusters in this data set because before level 8, the 
clusters formed are too small and the percentage of data not covered by the clusters is too high. If 
we allow 5% points to lie outside prominent clusters, we can choose level 12 in the dendrogram. At 
level 12, the two clusters are of size 950 and 932. To examine the separation of the two clusters at 
this level, we calculate the ridgeline between them and plot the density function along the ridgeline 
in Figure 9(b). The mode heights of the two clusters are close, and the cluster separation is strong. 





Level 8 9 10 11 12 13 14 15 16 17 18 
Correct (%) 73.5 | 80.4 | 85.4 | 87.7 | 89.0 | 90.2 | 91.0 | 91.2 | 91.5 | 91.6 | 91.7 
Incorrect (%) 3.9 | 49 | 5.7 | 61 | 64 |65 | 68 | 68 | 69 |71 | 7.2 
Uncovered (%) | 22.6 | 14.8 | 8.9 | 6.2 | 46 | 3.2 | 2.2 | 2.0 | 16 | 1.2 | 1.0 





















































Table 3: The clustering accuracy of HMAC for the document data set. The percentages of points 
that are correctly clustered, incorrectly clustered, and not covered by the two major clusters 
are listed. 
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From level 8 to 18, we compute the percentages of points that are correctly clustered, incorrectly 
clustered, and not covered by the two major clusters. Table 3 provides the result. At level 18, 91.7% 
data points are correctly clustered. For comparison, we apply Mclust to the same data set. If we 
specify the range for the number of clusters as 2 to 10 and let Mclust choose the best number and the 
best covariance structure using BIC, the number of clusters chosen is 3. The sizes of the 3 clusters 
are 763, 668, and 541. The first two clusters are highly pure in the sense of containing points from 
the same class. The third cluster however contains about 40% class 1 points and 60% class 2. If we 
label the third cluster as class 2, the overall clustering accuracy is 87.6%. On the other hand, if we 
fix the number of clusters at 2 and run Mclust, the clustering accuracy becomes 94%. 


6.2.4 DISCUSSION ON PARAMETER SELECTION 


As shown by the above examples of real data clustering, it is often not obvious which level in the 
dendrogram produced by HMAC should be used to yield the final clustering. This is a general issue 
faced by agglomerative clustering approaches. Prior information about data or our implicit assump- 
tions about good clusters can play an important role in this decision. One common assumption we 
make is that a valid cluster ought not be too small. Under this principle, we declare a level of the 
dendrogram too low (small bandwidth) if all the clusters are small and a level too high (large band- 
width) if a very large portion of the data (e.g., 95%) belong to a single cluster. For the intermediate 
acceptable levels in the dendrogram, we have designed the coverage rate mechanism to ensure that 
very small clusters are excluded. The chosen coverage rate reflects the amount of outliers one be- 
lieves to exist. Despite the inevitable subjectiveness of this value, the coverage rate affects only the 
grouping of a small percentage of outlier points. 


A more profound effect on the clustering result comes from the merging procedure based on sep- 
arability which involves choosing a separability threshold. As we have discussed in Section 4.2, the 
merging process based on separability is essentially the directional single linkage clustering which 
stops when all the between-cluster separability measures are above the threshold. The threshold 
directly determines the final number of clusters, but is irrelevant to the full clustering hierarchy gen- 
erated by the directional single linkage. Hence in situations where we have a targeted number of 
clusters to create, we can avoid choosing the threshold and simply stop the directional single linkage 
when the given number is reached. Of course, if at a certain level of the dendrogram, there exist 
precisely the targeted number of valid clusters, we can use that level directly rather than applying 
merging on clusters at a lower level. Whether the HMAC dendrogram clearly suggests the right 
number of clusters can be highly data dependent. For instance, in the document clustering example, 
the dendrogram in Figure 9(a) strongly suggests two clusters because at all the acceptable levels 
there are two reasonably large clusters. On the other hand, for the glass data set with results shown 
in Figure 6, it is not so clear-cut whether there are three or two clusters. 


Another choice we need to make in HMAC is the sequence of kernel bandwidths, 6; < 62 < 
++ < On. It is found empirically that as long as the grid of bandwidths is sufficiently fine, the 
prominent clusters created are not sensitive to the exact sequence. Major clustering structures often 
remain over a wide range of bandwidths. We have always used a uniformly spaced sequence of 
bandwidths in our experiments. If precisely two clusters merge at every increased level of a dendro- 
gram, that is, the number of clusters decreases exactly by one, the dendrogram will have n levels, 
where n is the data size. We call such a dendrogram all-size since the clustering into any number of 
groups smaller than n appears at a certain level. We rarely observe an all-size dendrogram generated 
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by HMAC. On the other hand, by using extremely refined bandwidths, we indeed obtained all-size 
dendrograms for the example in Section 3.2 and the infant data set. 


We note that linkage clustering by construction generates the all-size dendrogram. However, it 
is not necessarily a good practice to simply choose a level in the dendrogram that yields the desired 
number of clusters, as we have discussed previously for the dendrogram of HMAC. Hence, the fact 
that the dendrogram generated by HMAC is usually not all-size raises little concern. In practice, 
since the targeted number of clusters is normally much smaller than the data size, it is easy to find 
a relatively low level at which the number of clusters exceeds the target. We can then apply the 
separability based directional single linkage clustering at that level, and achieve any number of 
clusters smaller than the starting value. 


6.3 Image Segmentation 


To demonstrate the applicability of HMAC to computationally intensive tasks, we develop an im- 
age segmentation algorithm based on HMAC. A basic approach to image segmentation is to cluster 
the pixel color components and label pixels in the same cluster as one region (Li and Gray, 2000). 
This approach partitions images into regions that are relatively homogeneous. Examples of seg- 
mentation via clustering are shown in Figure 10. We employed the speeding up method described 
in Section 3.3. Our image segmentation method comprises the following steps: (a) Apply k-center 
algorithm to cluster image pixels into a given number of groups. This number is significantly larger 
than the desired number of regions. In particular, we set it to 100. (b) Form a data set {x1,...,xn}, 
n = 100, where x; is the mean of the vectors assigned to the ith group by k-center clustering. For 
each x;, assign weight w;, where w; is the percentage of pixels assigned to x;. (c) Apply the weighted 
version of HMAC to the data set. The only difference lies in the formula for the kernel density esti- 
mator. In the weighted version, f(x) = £; wio(x | x;,D(07)). (d) Starting from the first level of the 
dendrogram formed by HMAC, apply the cluster merging algorithm described in Section 4.2. If the 
number of clusters after merging is smaller than or equal to the given targeted number of segments, 
stop and output the clustering results at this level. Otherwise, repeat the merging process at the 
next higher level of the dendrogram. For brevity, we simply refer to this segmentation algorithm as 
HMAC. 


To assess the computational efficiency of HMAC, we experiment with 100 digital photo images 
randomly selected from the Corel image database (Wang et al., 2001). Every image is of size 
256 x 384 or 384 x 256. The experiments were conducted on a 1.2GHz Sun UltraSparc processor. 
We compare the segmentation time of HMAC and k-means. On average, it takes 4.41 seconds 
to segment an image using HMAC. The average number of segmented regions is 6.0. For k-means 
clustering, we experimented with both dynamically determining and fixing the number of segmented 
regions. In the dynamic case, the average number of regions generated per image by thresholding 
is 5.5. The average segmentation time for each image is 4.43 seconds, roughly equal to the time 
of HMAC. However, the computation time of k-means increases if more regions are formed for an 
image. If we fix the number of segmented regions to 6.0, the average segmentation time is 4.87 
seconds per image. 

In terms of segmentation results, whether HMAC or k-means is preferred is application depen- 
dent. K-means clusters the pixels and computes the centroid vector for each cluster according to 
the criterion of minimizing the mean squared distance between the original vectors and the centroid 
vectors. HMAC, however, finds the modal vectors, at which the kernel density estimator achieves 
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Figure 10: Segmentation results. First row: Original images. Second row: mode colors of the 
clusters generated by HMAC. Third row: mean colors of the clusters generated by k- 
means. 


a local maxima. These vectors are peaks of density bumps. They are significant in the sense of 
possessing locally maximum density, but may not be the best approximation to the original vectors 
in an average sense. Figure 10 shows the representative colors extracted by HMAC and k-means 
for several impressionism paintings. For HMAC, the mode color vector of each cluster is shown as 
a color bar; and for K-means, the mean vector of each cluster is shown. The representative colors 
generated by k-means tend to be “muddier’” due to averaging. Those by HMAC retain the true col- 
ors better, for instance, the white color of the stars in the first picture and the purplish pink of the 
vase in the second. On the other hand, HMAC may ignore certain colors that either are not distinct 
enough from others or do not contain enough pixels. For the purpose of finding the main palette of 
a painting, HMAC may be more preferable. 


7. Conclusion 


In this paper, we have introduced an EM-style algorithm, namely, Modal EM (MEM), for finding 
local maxima of mixture densities. For a given data set, we model the density of the data non- 
parametrically using kernel functions. Clustering is performed by associating each point to a mode 
identified by MEM with initialization at this point. A hierarchical clustering algorithm, HMAC, is 
developed by gradually increasing the bandwidth of the kernel functions and by recursively treating 
modes acquired at a smaller bandwidth as points to be clustered when a larger bandwidth is used. 
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The Ridgeline EM (REM) algorithm is developed to find the ridgeline between the density 
bumps of two clusters. A separability measure between two clusters is defined based on the ridge- 
line, which takes comprehensive consideration of the exact densities of the clusters. A cluster merg- 
ing method based on pairwise separability is developed, which addresses the competing factors of 
using a small bandwidth to retain major clustering structures and using a large one to achieve a low 
number of clusters. 


The HMAC clustering algorithm and its combination with the cluster merging algorithm are 
tested using both simulated and real data sets. Experiments show that our algorithm tends to unite 
merits of linkage clustering and mixture-model-based clustering. Applications to both simulated 
and real data also show that the algorithm works robustly with high dimensional data or clusters de- 
viating substantially from Gaussian distributions. Both of these cases pose difficulty for parametric 
mixture modeling. 


A C package at http://www. stat.psu.edu/~jiali/hmac is developed, which includes the 
implementation of the HMAC algorithm, REM for computing ridgelines, and the separability/ 
coverage rate based merging algorithm. 


There are several directions that can be pursued in the future to strengthen the framework of 
modal clustering. In the current work, we use a fixed bandwidth for the kernel density functions. 
We can explore ways to make the bandwidth vary with the location of the data because there may not 
exist a single bandwidth suitable for the entire data set. Moreover, in this paper, we have discussed 
an approach to best visualize clusters in lower dimensions. A related and interesting question is to 
find a lower dimensional subspace in which the data form well separated modal clusters. We expect 
that the optimization of the subspace needs to be conducted as an integrated part of the modal 
clustering procedure. 
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Appendix A. 


We prove the ascending property of the MEM algorithm. Let the mixture density be f(x) = 
Li mfi(x). Denote the value of x at the rth iteration of MEM by x, We need to show 
f(x) > f(x), or equivalently, log f(x") > log f(x). 


Let us introduce the latent discrete random variable J € {1,2,...,K} with prior probabilities 
P(J =k) = Tg. Assume that the conditional density of X given J =k is f(x). Then the marginal 
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distribution of X is f(x), as specified above. Define the following functions: 
L(x) = F F(x); 
Q(x’ |x) = È Px(x) log Tr fx(2’), 


px(x) log p(x’) 


Me» 


H(x |x) = Q(x |x)-Lx) = 


k=1 








where p(x) = P(J =k | X =x) Sate is the posterior probability of J being k given x. Denote 
the posterior probability mass function (pmf) given x by p(x) = (p1(x), p2(x),...,pK(x)). 

Because H (x | x) — H(x’ | x) = D(p(x) || p(2’)) and relative entropy D(- || -) is always nonneg- 
ative (Cover and Thomas, 1991), H(x’ | x) < H(x | x) for any pair of x and x’. On the other hand, 
according to the MEM algorithm, 


KD agma Spal) los fe) 
k=1 
K 
= argmax È peal foam S not ))log f(x) 
x 


k=1 k=1 
| 


= argmax Q(x | x"). 


Hence, Q(x") | x) > Q(x”) | x). Finally, we prove the ascending property: 


Leal) = QHD |x) Ha |x) > OM |) -HGO |x) = 10). 


Appendix B. 


Recall that the Ridgeline EM algorithm aims at maximizing logg(x | a) = (1 — a)loggi(x) + 
aloggz(x), where 0 < & < 1 and g; (x) and g2(x) are two mixture densities g;(x) = X11 Tixhix(x), 
i= 1,2. We prove here the ascending property of this algorithm, as described in Section 4.1. 

Following the definitions in Appendix A, we form functions L;(x), Q;(x' | x), and H;(x'|x) 
for densities g;(x), i= 1,2, respectively. Specifically, L;(x) = log g;(x), Q(x’ | x) = Zi pix(x) 
log Tt; /ixn(x’), and H;(x’ |x) = Z1 pix(x) log pix(x’), where pj x(x) = Tixhix(x)/gi(x). Note that, 
based on the proof in Appendix A, we have L;(x’) = Q;(x’ | x) — H;(x' | x) and H;(x | x) > H;(x’ | x). 
We now define 


L(x) = (1—-a)Li(x) + L(x) , 
O(x |x) = (1-0) | x) + 0Q2(x' | x) , 
A(x’ |x) = A-a)H (x | x)+H2(x' |x). 


According to the Ridgeline EM algorithm, x(t!) = argmax, Õ(x' | i 0). Hence Q(x") | 
0) > Õ(x | x). Also, it is obvious that A(x | x”) > A(x) | x). Finally, we prove 
the ascending property: 


Leal) = OECD |x) ARD |) > GEO | x1) — AG [2) = 1G), 
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Appendix C. 


We prove here the graph constructed in Section 4.2, where every clique corresponds to a node, has 
no loop. We denote a node (i.e., a clique) by c;. By construction, a directed edge from c; to cj exists 
if the following conditions are satisfied: 


1. Sc(ci,c;) <€. 


2. Sc(ci,cj) = mingzi Se(Ci,Ck) and j is the smallest index among all those j’’s that achieve 
Selci, cj) = mingz;So(ci,cx). 


3. lci) < 8(c;). 


We refer to the three conditions as Condition 1, 2, 3. 

We prove the non-existence of loops by contradiction. We will first show that if there is a loop in 
the graph, this loop is directed. Then, we will prove that a directed loop cannot exist. Without loss 
of generality, assume that there is a loop connecting nodes {c,C2,...,c,} sequentially. The edges 
in the loop are {@1,7, 2,3, ..-,€x—1,k, k,1}, Where e; j connects c; and cj. Let head(e;,;+1) indicate the 
node from which edge e;;+1 starts. Obviously, head (e; i41) = Ci Or Ci+1. 

Without loss of generality, let head (ex1) = c. By Condition 2, every node can have almost 
one edge starting from it. Hence head(ek—1 k) = cx—1. For an arbitrary j > 1, assume that for 
i= j,j+1,...,k— 1, we have head(e; +1) = ci. Since head(e;,;+1) = cj, again by Condition 2, 
head(ej-1,j) = cj-1. Thus, for i= j—1,j,...,.4—1, we have head(ejj,1) = ci. By induction, for 
any i = 1,...,k— 1, we have head(eii1) = ci. Therefore, the loop connecting {c1,c2,...,cx} is 
directed. 

By Condition 3 of the graph construction procedure, if head(e;,;) = ci, then 6; < 6;. Thus, if 
there is a directed loop connecting nodes {c1,c2,...,cx} and head(e;j+1) = cj, we get the contradic- 
tion: 6; < & <--- < Ò < 6). This proves that there is no loop (regardless of directed or not) in the 
graph. 
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